Venice is drowning - final report

Abstract

The main objective of the project is to analyze the data of the tide detections regarding the area of the Venice lagoon, producing predictive models whose performances are evaluated on a time horizon ranging from one hour up to a week of forecast.

For this purpose, three models, both linear and machine-learning based, are tested:

  • ARIMA (AutoRegressive Integrated Moving Average);
  • UCM (Unobserved Component Models);
  • LSTM (Long Short-Term Memory).

Datasets

Two datasets are the basis for the project pipeline:

  • the “main” dataset contains the tides level measurements (in cm) in the Venice lagoon from a certain reference level, obtained through the use of a sensor, between 1983 and 2018;
  • a second dataset holds the information regarding meteorological variables such as rainfall in mm, wind direction in degree at 10 meters and finally wind speed at 10 meters in meters per second in the periods between 2000 and 2019.

The tides level dataset is composed using the single historical datasets made public by the city of Venice, in particular from Centro Previsioni e Segnalazioni Maree. The data regarding the meteorological variables, instead have been provided, on request, by ARPA Veneto.

All the preprocessing operations regarding parsing, inspection and the final union of the cited datasets are available in the following scripts:

  • parsing_tides_data allows to perform the construction of the tidal dataset, importing and unifying each single annual dataset;
  • inspection contains a series of preliminar inspection of the aformentioned data:
  • preprocess_weather_data_2000_2019 contains the preprocessing operations of the weather-related dataset;
  • parsing_tides_weather reports a summary of the procedure implemented in order to deal with missing data in the weather dataset, and contains the merging operation producing the final weather dataset.

As a precise choice, due to time-related and computational reasons, only the data ranging from 2010 and 2018 are kept after the preprocessing.

Data inspection

During the preprocessing phase, some descriptive visualizations regarding the main time series are produced in order to inspect its characteristics.

Figure 1
Figure 1: Time serie visualization with autocorrelation and partial autocorrelation plots


Figure 2
Figure 2: Time serie distribution

Figure 3
Figure 3: output test ADF

Figure 4
Figure 4: Visualization of stationarity in variance


Figure 5
Figure 5: Frequency visualization using periodogram


Models

As anticipated, the models created will focus on two areas, one more purely statistical with linear models such as ARIMA and UCM and the other of machine learning, through the definition of an LSTM model. The preparations and implementations of the models will be presented below and finally a section of results will be proposed in which it will be possible to make a rapid comparison between the performance of the models on a test set defined a priori. Referring to that, it is worth highlighting the data used for both areas:

  • for the linear models the training set is composed by the last six months of 2018, from July to December;
  • for the machine learning one, considering the capacity of handle more data with constant computational time, the training set cover the period between January 2010 and December 2018.

The test set, previously extracted, refer to the last two weeks of December 2018, i.e. from 17/12/2018 23:00:00 to 31/12/2019 23:00:00.

Figure 6
Figure 6: Train and test data representation


With reference to the linear models, two strategies are implemented: the former consist in integrate the meteorological variables with the lunar motion while the latter in extracting the principal periodic components exploiting oce, an R package that helps Oceanographers do their work by providing functions to read Oceanographic data files.

Regarding the first strategy, after processing the meteorological data as previously mentioned, using the API PyEphem, an astronomy library that provides basic astronomical computations for the Python programming language. Given a date and location on the Earth’s surface, it can compute the positions of the Sun and Moon, of the planets and their moons, and of any asteroids, comets, or earth satellites whose orbital elements the user can provide. In order to track the lunar motion all we have to do is to select the period of interest and the coordinates representing Venice.

plotly
Figure 7: Interactive plot representing lunar motion between 2010 and 2018


The second strategy instead, as anticipated, concerns the principal periodic components extractable from a time series about sea levels in order to be used as regressors for the tides level time serie. The oce package provide a function called tidem able to fit a model in terms of sine and cosine components at the indicated tidal frequencies, with the amplitude and phase being calculated from the resultant coefficients on the sine and cosine terms. Tidem provides the possibility to extract till 69 components but we focused on 8 of them, in particular:

  • M2, main lunar semi-diurnal with a period of ~12 hours;
  • S2, main solar semi-diurnal (~12 hours);
  • N2, lunar-elliptic semi-diurnal (~13 hours);
  • K2, lunar-solar semi-diurnal (~12 hours);
  • K1, lunar-solar diurnal (~24 hours);
  • O1, main lunar diurnal (~26 hours);
  • SA, solar annual (~24*365 hours);
  • P1, main solar diurnal (24 hours).
plotly
Figure 8: Interactive filtering plot for the extracted components

ARIMA

Both the realized linear models use the data between 25/06/2018 00:00:00 and 17/12/2018 23:00:00 as training set. This choice was determined by the needs of optimizing the models’ fitting time because, taking more data, there was an important temporal expansion. Both ARIMA and UCM are programmed using R, in particular with forecast and KFAS packages.

As first approach to the forecast task we trained two ARIMA models: the former is trained using as regressors the meteorological data provided by ARPA Veneto with the lunar motion obtained using PyEPhem while the latter using the 8 harmonics from the use of oce and tidem.

Starting from the first model created it’s worth to notice that the meteorological data have been standardized and the lunar motion, as seen in literature, is taken with the following form:

\[\begin{equation} lunar\_var = \frac{1}{lunar\_motion^2} \end{equation}\]

The first impact of the variables is substantially insignificant as is possible to see in fig. 9, the correlations and the partial correlations mantain their magnitude and the fitting on the training data is weak.

Figure 9
Figure 9: First output from ARIMA 1


After several attemps, following the represented Lag on ACF and PACF plots in combination with the value of the AICc and the Mean Absolute Percentage Error (MAPE), a highly parameterized model has been reached with the form (3,1,3)(1,1,3)[24]. Although the autocorrelation has not been completely absorbed, the Box-Ljung test indicates that it is no longer present for the first few hours. The fitting on the training set also improves considerably and the performance on the test proves to be quite good as will be illustrated below. The autocorrelation and the fitting performances are visible in fig. 10.

Figure 10 Figure 10b
Figure 10: Final output from ARIMA 1


As anticipated, the second ARIMA model realized use as input regressors the 8 harmonics extracted using tidem and oce. Since it is a functional form based solely on time, it is possible for us to obtain them also for the projections. An example of this harmonics is shown in fig. 8 in an interactive fashion. The effect of the use of harmonics is already visible starting from the basic model, that is, the one without autoregressive components or moving average.

Figure 11
Figure 11: Final output from ARIMA 1


UCM

LSTM

Inserire qui procedimento svolto LSTM

Dario Bertazioli, Fabrizio D’Intinosante

2020-01-25